Hypertext Classification Diploma thesis Hervé
نویسندگان
چکیده
Web Directories have been historically collected and updated by hand but this method is unsatisfactory for three reasons: A team of Web Surfers maintaining such a database should face the gigantism of the World network. Its size would thus be incompatible with the economic constraints of startups. Even the biggest team would not be able to trace all the changes on the Web and to keep the database up to date. Furthermore, categorization is a highly subjective task. Manual categorization is not a synonym for good categorization. However, automating the categorization of documents is a difficult task in the Web environment. The diversity of languages, topics and authorships prevents the traditional classification algorithms to work optimally. Fortunately, the internal HTML structure of the Web pages and the hyperlink graph structure of the Web are new sources of information that can be explored to improve automated Web page classification. In this diploma thesis, we carry on the work of Fürnkranz [8] about hypertext categorization. We investigate different classification techniques for categorizing hypertext documents. We target information rich text areas of the page and of its neighbors and we compare different methods for having those various features optimally help together for improved classification. We evaluate the heavy points and the weaknesses of the Hyperlink Ensembles and Meta Predecessor approaches. We explain how to choose a binarization algorithm between Round Robin and One Against All according to the behavior awaited. We compare two solutions for bringing features mined on different locations together, namely Tagging and Merging and we finally propose a model of hypertext classifier which combines the best characteristics of the methods we study. Our main result is a model of hyperlink based classifier that outperforms a text only classifier by almost 25% for the WebKB dataset.
منابع مشابه
A Distributed Education Environment Based on Mathematica
This paper presents the concept of an environment for distributed mathematical education based on the computer algebra system Mathematica c. Mathematica is a \system for doing mathematics by computer" which combines a computational engine with powerful presentation and visualization facilities based on the concept of notebooks (multimedia hypertext documents). The system has gained large popula...
متن کاملLehrstuhl für Effiziente Algorithmen Diploma Thesis in Informatics Automata - based IP Packet Classification
AMS MSC: 68Q45 Formal languages and automata 68M12 Network protocols Declaration: " I hereby declare that this thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration unless stated otherwise.
متن کاملLink-Local Features for Hypertext Classification
Previous work in hypertext classification has resulted in two principal approaches for incorporating information about the graph properties of the Web into the training of a classifier. The first approach uses the complete text of the neighboring pages, whereas the second approach uses only their class labels. In this paper, we argue that both approaches are unsatisfactory: the first one brings...
متن کاملDecomposition of Polynomials
This diploma thesis is concerned with functional decomposition f = g ◦ h of polynomials. First an algorithm is described which computes decompositions in polynomial time. This algorithm was originally proposed by Zippel (1991). A bound for the number of minimal collisions is derived. Finally a proof of a conjecture in von zur Gathen, Giesbrecht & Ziegler (2010) is given, which states a classifi...
متن کاملUnequal Error Protection Turbo Codes Diploma Thesis
I affirm that I wrote the Diploma Thesis by my self and that I did not use other than the indicated sources and resources.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005